About: HR analytics is revolutionising the way human resources departments operate, leading to higher efficiency and better results overall. Human resources has been using analytics for years. However, the collection, processing and analysis of data has been largely manual, and given the nature of human resources dynamics and HR KPIs, the approach has been constraining HR. Therefore, it is surprising that HR departments woke up to the utility of machine learning so late in the game. Here is an opportunity to try predictive analytics in identifying the employees most likely to get promoted.
Problem Statement: Your client is a large MNC and they have 9 broad verticals across the organisation. One of the problem your client is facing is around identifying the right people for promotion (only for manager position and below) and prepare them in time. Currently the process, they are following is:
For above mentioned process, the final promotions are only announced after the evaluation and this leads to delay in transition to their new roles. Hence, company needs your help in identifying the eligible candidates at a particular checkpoint so that they can expedite the entire promotion cycle.
They have provided multiple attributes around Employee's past and current performance along with demographics. Now, The task is to predict whether a potential promotee at checkpoint in the test set will be promoted or not after the evaluation process.
Attributes Information:
Evaluation Metric: F1-Score.
import numpy as np
import pandas as pd
import seaborn as sns
sns.set_style('whitegrid')
import matplotlib.pyplot as plt
import sweetviz as sv
from pandas_profiling import ProfileReport
import statistics
from scipy.stats import skew, norm
import warnings
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns',None)
pd.set_option('display.expand_frame_repr',False)
pd.set_option('max_colwidth',-1)
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, f1_score
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import OneHotEncoder
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.ensemble import AdaBoostClassifier
from catboost import CatBoostClassifier
from lightgbm import LGBMClassifier
from mlxtend.classifier import EnsembleVoteClassifier
from mlxtend.classifier import StackingClassifier
train=pd.read_csv('train_LZdllcl.csv')
test=pd.read_csv('test_2umaH9m.csv')
submission=pd.read_csv('sample_submission_M0L0uXE.csv')
train
test
train.info()
test.info()
We can see there are some NaN values in variables such as education and previous_year_rating in both train and test sets.
print(train.shape)
for i in train.columns.values:
print (i)
print (len(train[i].unique()))
print("----------")
print(test.shape)
for i in test.columns.values:
print (i)
print (len(test[i].unique()))
print("----------")
The levels of categorical variables in both train and test sets are equal.
train[train.isnull().any(axis=1)]
test[test.isnull().any(axis=1)]
sns.countplot(x='is_promoted',data=train,palette='bright');
The target variable's class is highly imbalance.
Generates profile reports from a pandas DataFrame. The pandas df.describe() function is great but a little basic for serious exploratory data analysis. pandas_profiling extends the pandas DataFrame with df.profile_report() for quick data analysis.
For each column the following statistics - if relevant for the column type - are presented in an interactive HTML report:
On train set
profile_train=ProfileReport(train,title='Train',explorative=True)
profile_train
On test set
profile_test=ProfileReport(test,title='Test',explorative=True)
profile_test
On train set
statistics.mode(train['education'])
statistics.mode(train['previous_year_rating'])
train['education'].fillna("Bachelor's",inplace=True)
train['previous_year_rating'].fillna(3.0,inplace=True)
On test set
statistics.mode(test['education'])
statistics.mode(test['previous_year_rating'])
test['education'].fillna("Bachelor's",inplace=True)
test['previous_year_rating'].fillna(3.0,inplace=True)
sns.countplot(x='education',data=train,palette='dark');
f,ax=plt.subplots(figsize=(6,4))
sns.countplot(x='department',data=train,palette='dark');
ax.set_xticklabels(ax.get_xticklabels(),rotation=30);
fig,(axis1,axis2)=plt.subplots(1,2,figsize=(15,5));
sns.barplot(x='KPIs_met >80%',y='age',hue='education',palette='rainbow',data=train,ax=axis1);
sns.barplot(x='awards_won?',y='age',hue='education',palette='rainbow',data=train,ax=axis2);
We see similar patterns in age and education of an employee when compared to KPIs_met >80% and awards_won.
f,ax=plt.subplots(figsize=(20,10))
g=sns.boxplot(x='department',y='avg_training_score',hue='education',data=train,palette='bright');
We can observe that the departments like technology, analytics and R&D have high average training score. Both analytics and technology departments have employees which minimum have bachelor's degree. A very strange thing we can see is in sales and marketing the variance is very large even after there least education is bachelor's degree.
f,ax=plt.subplots(figsize=(8,10))
sns.barplot(x='no_of_trainings',y='avg_training_score',hue='education',palette='rainbow',data=train);
Employees that have below secondary education, don't have more than 4 trainings.
f,ax=plt.subplots(figsize=(8,6))
g=sns.boxplot(x='education',y='age',hue='gender',data=train,palette='colorblind');
fig,(axis1,axis2)=plt.subplots(1,2,figsize=(15,5));
sns.stripplot(x='previous_year_rating',y='age',data=train,hue='awards_won?',jitter=True,ax=axis1);
sns.stripplot(x='previous_year_rating',y='avg_training_score',data=train,hue='awards_won?',jitter=True,ax=axis2);
Whether the employee will win awards has very little to do with age and previous year rating but according to training score we see more the score, award can be won.
g=sns.FacetGrid(data=train,col='recruitment_channel');
g.map(plt.hist,'age');
Age in each plot is same but we see recruitments from channel other is more.Through sourcing, there is slightly less recruits and then there are very less through reference.
f,ax=plt.subplots(figsize=(20,10))
sns.stripplot(x='department',y='avg_training_score',data=train,hue='is_promoted',jitter=True,palette='bright');
Clearly high average training score is a factor to get promotion, but we can also see some promotions where the training score is low. Surprisingly r&d department have got employees with minimum promotions apart from having good training score and education level. Most no. of promotions can be seen in sales and marketing, operations and procurement.
f,ax=plt.subplots(figsize=(20,10))
g=sns.boxplot(x='region',y='avg_training_score',hue='is_promoted',data=train);
ax.set_xticklabels(ax.get_xticklabels(),rotation=30);
Clearly region has nothing to do with promotion.
g=sns.stripplot(x='KPIs_met >80%',y='avg_training_score',hue='is_promoted',data=train,palette='deep',jitter=True);
There is a higher chance of getting promoted if the employee falls in a category where he/she has a Key performance Indicator greater that 80%.
f,ax=plt.subplots(figsize=(7,7))
sns.stripplot(x='recruitment_channel',y='avg_training_score',hue='is_promoted',palette='BuPu',data=train,jitter=True);
Employees recruited from reference have less no. of promotions.
fig,(axis1,axis2)=plt.subplots(1,2,figsize=(15,5));
g=sns.stripplot(x='education',y='length_of_service',hue='gender',data=train,ax=axis1,palette='muted');
g=sns.stripplot(x='department',y='length_of_service',hue='gender',data=train,ax=axis2,palette='muted');
axis2.set_xticklabels(axis2.get_xticklabels(),rotation=30);
In first plot we see employess of education below secondary, have very less length of service.
If departments are concerned only analytics, r&d and legal have less no. of service in years and surprisingly there are very less no. of female employees in those departments. STRANGE
On train set
sns.set_color_codes(palette='deep')
f,ax=plt.subplots(figsize=(5,5))
print("Skewness: %f" % train['age'].skew())
print("Kurtosis: %f" % train['age'].kurt())
sns.distplot(train['age'],color="b");
ax.xaxis.grid(False)
ax.set(ylabel="Frequency")
ax.set(xlabel="Age")
ax.set(title="Age distribution")
sns.despine(trim=True,left=True)
plt.show();
It is Skewed to right, so we will apply a log(1+x) transformation to fix the skew.
train["age_trans"]=np.log1p(train["age"])
sns.set_color_codes(palette='deep')
f,ax=plt.subplots(figsize=(5,5))
sns.distplot(train['age_trans'],fit=norm,color="b");
print("Skewness: %f" % train['age_trans'].skew())
print("Kurtosis: %f" % train['age_trans'].kurt())
ax.xaxis.grid(False)
ax.set(ylabel="Frequency")
ax.set(xlabel="Age")
ax.set(title="Age distribution")
sns.despine(trim=True,left=True)
plt.show()
Our Kurtosis is -0.0865 which is small. Similarly the Skewness is 0.49 which is Symmetrical.
On test set
sns.set_color_codes(palette='deep')
f,ax=plt.subplots(figsize=(5,5))
print("Skewness: %f" % test['age'].skew())
print("Kurtosis: %f" % test['age'].kurt())
sns.distplot(train['age'],color="b");
ax.xaxis.grid(False)
ax.set(ylabel="Frequency")
ax.set(xlabel="Age")
ax.set(title="Age distribution")
sns.despine(trim=True,left=True)
plt.show();
test["age_trans"]=np.log1p(test["age"])
sns.set_color_codes(palette='deep')
f,ax=plt.subplots(figsize=(5,5))
sns.distplot(train['age_trans'],fit=norm,color="b");
print("Skewness: %f" % test['age_trans'].skew())
print("Kurtosis: %f" % test['age_trans'].kurt())
ax.xaxis.grid(False)
ax.set(ylabel="Frequency")
ax.set(xlabel="Age")
ax.set(title="Age distribution")
sns.despine(trim=True,left=True)
plt.show()
Our Kurtosis is -0.0841 which is small. Similarly the Skewness is 0.500 which is Symmetrical.
The education attribute is ordinal, we will map them with nos.
On train set
map_edu_tr={"Below Secondary":1,"Bachelor's":2,"Master's & above":3}
for dataset in train:
train['education_ord']=train['education'].map(map_edu_tr)
On test set
map_edu_te={"Below Secondary":1,"Bachelor's":2,"Master's & above":3}
for dataset in test:
test['education_ord']=test['education'].map(map_edu_te)
train=train.drop(columns=['education'])
test=test.drop(columns=['education'])
On train set
dummy_tr_1=pd.get_dummies(train['department'],drop_first=True,prefix='dep',prefix_sep='_')
dummy_tr_2=pd.get_dummies(train['gender'],drop_first=True,prefix='gen',prefix_sep='_')
dummy_tr_3=pd.get_dummies(train['recruitment_channel'],drop_first=True,prefix='rc',prefix_sep='_')
dummy_tr_4=pd.get_dummies(train['KPIs_met >80%'],drop_first=True,prefix='kp',prefix_sep='_')
dummy_tr_5=pd.get_dummies(train['awards_won?'],drop_first=True,prefix='aw',prefix_sep='_')
train=pd.concat([train,dummy_tr_1,dummy_tr_2,dummy_tr_3,
dummy_tr_4,dummy_tr_5],axis=1)
train=train.drop(columns=['department','gender','recruitment_channel',
'KPIs_met >80%','awards_won?'])
On test set
dummy_te_1=pd.get_dummies(test['department'],drop_first=True,prefix='dep',prefix_sep='_')
dummy_te_2=pd.get_dummies(test['gender'],drop_first=True,prefix='gen',prefix_sep='_')
dummy_te_3=pd.get_dummies(test['recruitment_channel'],drop_first=True,prefix='rc',prefix_sep='_')
dummy_te_4=pd.get_dummies(test['KPIs_met >80%'],drop_first=True,prefix='kp',prefix_sep='_')
dummy_te_5=pd.get_dummies(test['awards_won?'],drop_first=True,prefix='aw',prefix_sep='_')
test=pd.concat([test,dummy_te_1,dummy_te_2,dummy_te_3,dummy_te_4,dummy_te_5],axis=1)
test=test.drop(columns=['department','gender','recruitment_channel',
'KPIs_met >80%','awards_won?'])
For our modelling we will split our train set in train-validation set in 67-33% proportion. But first we will assign Indpendent and Dependent variables.
X=train.drop(columns=['employee_id','is_promoted','age','region'])
y=train['is_promoted']
X_train,X_val,y_train,y_val=train_test_split(X,y,test_size=0.33,
random_state=44)
X_train.shape
X_train
classifier_LR=LogisticRegression()
classifier_LR.fit(X_train,y_train)
y_pred_LR_tr=classifier_LR.predict(X_train)
y_pred_LR_val=classifier_LR.predict(X_val)
f1_lr_tr=f1_score(y_train,y_pred_LR_tr)
f1_lr_val=f1_score(y_val,y_pred_LR_val)
print('Train F1-Score')
print(f1_lr_tr*100)
print('Validation F1-Score')
print(f1_lr_val*100)
classifier_KNN=KNeighborsClassifier()
classifier_KNN.fit(X_train,y_train)
y_pred_KNN_tr=classifier_KNN.predict(X_train)
y_pred_KNN_val=classifier_KNN.predict(X_val)
f1_knn_tr=f1_score(y_train,y_pred_KNN_tr)
f1_knn_val=f1_score(y_val,y_pred_KNN_val)
print('Train F1-Score')
print(f1_knn_tr*100)
print('Validation F1-Score')
print(f1_knn_val*100)
classifier_DT=DecisionTreeClassifier()
classifier_DT.fit(X_train,y_train)
y_pred_DT_tr=classifier_DT.predict(X_train)
y_pred_DT_val=classifier_DT.predict(X_val)
f1_dt_tr=f1_score(y_train,y_pred_DT_tr)
f1_dt_val=f1_score(y_val,y_pred_DT_val)
print('Train F1-Score')
print(f1_dt_tr*100)
print('Validation F1-Score')
print(f1_dt_val*100)
classifier_DT_GS=DecisionTreeClassifier()
params_grid_DT={'criterion':['gini','entropy'],
'splitter':['best','random'],
'max_depth':[5,10,15],
'min_samples_split':[2,3,4,5],
'min_samples_leaf':[1,2,3,4],
'class_weight':['balanced',None]}
grid_search_DT=GridSearchCV(classifier_DT_GS,params_grid_DT,
n_jobs=-1,scoring='f1')
grid_search_DT.fit(X_train,y_train)
grid_search_DT.best_params_
classifier_DT_after=DecisionTreeClassifier(class_weight=None,criterion='entropy',max_depth=15,min_samples_leaf=3,min_samples_split=3,splitter='random')
classifier_DT_after.fit(X_train,y_train)
y_pred_DT_after_val=classifier_DT_after.predict(X_val)
y_pred_DT_after_tr=classifier_DT_after.predict(X_train)
f1_dt_tuned_val=f1_score(y_val,y_pred_DT_after_val)
f1_dt_tuned_tr=f1_score(y_train,y_pred_DT_after_tr)
print('Train F1-Score')
print(f1_dt_tuned_tr*100)
print('Validation F1-Score')
print(f1_dt_tuned_val*100)
classifier_RF=RandomForestClassifier()
classifier_RF.fit(X_train,y_train)
y_pred_RF_tr=classifier_RF.predict(X_train)
y_pred_RF_val=classifier_RF.predict(X_val)
f1_rf_tr=f1_score(y_train,y_pred_RF_tr)
f1_rf_val=f1_score(y_val,y_pred_RF_val)
print('Train F1-Score')
print(f1_rf_tr*100)
print('Validation F1-Score')
print(f1_rf_val*100)
gb=GradientBoostingClassifier(n_estimators=300,max_features=0.9,learning_rate=0.25,max_depth=4,
min_samples_leaf=2,subsample=1,verbose=0,random_state=12)
gb.fit(X_train,y_train)
y_pred_gb_val=gb.predict(X_val)
y_pred_gb_tr=gb.predict(X_train)
f1_gb_val=f1_score(y_val,y_pred_gb_val)
f1_gb_tr=f1_score(y_train,y_pred_gb_tr)
print('Train F1-Score')
print(f1_gb_tr*100)
print('Validation F1-Score')
print(f1_gb_val*100)
xgb=XGBClassifier(learning_rate=0.1,n_estimators=150,max_depth=5,min_child_weight=5,gamma=0.3,nthread=8,subsample=0.8,
colsample_bytree=0.8,objective='binary:logistic',scale_pos_weight=3,seed=12)
xgb.fit(X_train,y_train)
y_pred_xgb_val=xgb.predict(X_val)
y_pred_xgb_tr=xgb.predict(X_train)
f1_xgb_val=f1_score(y_val,y_pred_xgb_val)
f1_xgb_tr=f1_score(y_train,y_pred_xgb_tr)
print('Train F1-Score')
print(f1_xgb_tr*100)
print('Validation F1-Score')
print(f1_xgb_val*100)
XGB_GS=XGBClassifier()
params_grid_XGB={'booster':['gbtree','dart'],
'learning_rate':[0.01,0.1,1],
'gamma':[0.1,0.5,1],
'max_depth':[7,8,9,10],
'min_child_weight':[1,3,5],
'subsample':[0.2,0.4,0.6,0.8],
'colsample_bytree':[0.6,0.8,1],
'scale_pos_weight':[2,3,4],
'feature_selector':['cyclic','shuffle','random','greedy','thrifty']}
grid_search_XGB=GridSearchCV(XGB_GS,params_grid_XGB,
n_jobs=-1,scoring='f1')
grid_search_XGB.fit(X_train,y_train)
grid_search_XGB.best_params_
XGB_after=XGBClassifier(earning_rate=0.01,n_estimators=200,max_depth=5,min_child_weight=5,gamma=0.2,nthread=8,subsample=0.7,
colsample_bytree=0.7,objective='binary:logistic',scale_pos_weight=2,feature_selector='cyclic')
XGB_after.fit(X_train,y_train)
y_pred_XGB_after_val=XGB_after.predict(X_val)
y_pred_XGB_after_tr=XGB_after.predict(X_train)
f1_xgb_tuned_val=f1_score(y_val,y_pred_XGB_after_val)
f1_xgb_tuned_tr=f1_score(y_train,y_pred_XGB_after_tr)
print('Train F1-Score')
print(f1_xgb_tuned_tr*100)
print('Validation F1-Score')
print(f1_xgb_tuned_val*100)
ADA=AdaBoostClassifier()
ADA.fit(X_train,y_train)
y_pred_ADA_val=ADA.predict(X_val)
y_pred_ADA_tr=ADA.predict(X_train)
f1_ada_val=f1_score(y_val,y_pred_ADA_val)
f1_ada_tr=f1_score(y_train,y_pred_ADA_tr)
print('Train F1-Score')
print(f1_ada_tr*100)
print('Validation F1-Score')
print(f1_ada_val*100)
CB=CatBoostClassifier(max_depth=5)
CB.fit(X_train,y_train)
y_pred_CB_val=CB.predict(X_val)
y_pred_CB_tr=CB.predict(X_train)
f1_cb_val=f1_score(y_val,y_pred_CB_val)
f1_cb_tr=f1_score(y_train,y_pred_CB_tr)
print('Train F1-Score')
print(f1_cb_tr*100)
print('Validation F1-Score')
print(f1_cb_val*100)
LGBM=LGBMClassifier()
LGBM.fit(X_train,y_train)
y_pred_LGBM_val=LGBM.predict(X_val)
y_pred_LGBM_tr=LGBM.predict(X_train)
f1_lgbm_val=f1_score(y_val,y_pred_LGBM_val)
f1_lgbm_tr=f1_score(y_train,y_pred_LGBM_tr)
print('Train F1-Score')
print(f1_lgbm_tr*100)
print('Validation F1-Score')
print(f1_lgbm_val*100)
xgb=XGBClassifier(learning_rate=0.1,n_estimators=150,max_depth=5,min_child_weight=5,gamma=0.3,nthread=8,subsample=0.8,
colsample_bytree=0.8,objective='binary:logistic',scale_pos_weight=3,seed=12)
gb=GradientBoostingClassifier(n_estimators=300,max_features=0.9,learning_rate=0.25,max_depth=4,
min_samples_leaf=2,subsample=1,verbose=0,random_state=12)
lr=LogisticRegression()
blended_classifier=StackingClassifier(classifiers=[xgb,gb],
meta_classifier=lr)
blended_classifier.fit(X_train,y_train)
y_pred_BC_val=blended_classifier.predict(X_val)
y_pred_BC_tr=blended_classifier.predict(X_train)
f1_bcs_tuned_val=f1_score(y_val,y_pred_BC_val)
f1_bcs_tuned_tr=f1_score(y_train,y_pred_BC_tr)
print('Train F1-Score')
print(f1_bcs_tuned_tr*100)
print('Validation F1-Score')
print(f1_bcs_tuned_val*100)
xgb=XGBClassifier(learning_rate=0.1,n_estimators=150,max_depth=5,min_child_weight=5,gamma=0.3,nthread=8,subsample=0.8,
colsample_bytree=0.8,objective='binary:logistic',scale_pos_weight=3,seed=12)
gb=GradientBoostingClassifier(n_estimators=300,max_features=0.9,learning_rate=0.25,max_depth=4,
min_samples_leaf=2,subsample=1,verbose=0,random_state=12)
evc=EnsembleVoteClassifier(clfs=[xgb,gb],voting='soft')
evc.fit(X_train,y_train)
y_pred_evc_val=evc.predict(X_val)
y_pred_evc_tr=evc.predict(X_train)
f1_evc_tuned_val=f1_score(y_val,y_pred_evc_val)
f1_evc_tuned_tr=f1_score(y_train,y_pred_evc_tr)
print('Train F1-Score')
print(f1_evc_tuned_tr*100)
print('Validation F1-Score')
print(f1_evc_tuned_val*100)
import keras
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
Defining Errro Metric
from keras import backend as K
def recall_m(y_true,y_pred):
true_positives=K.sum(K.round(K.clip(y_true * y_pred,0,1)))
possible_positives=K.sum(K.round(K.clip(y_true,0,1)))
recall=true_positives / (possible_positives + K.epsilon())
return recall
def precision_m(y_true,y_pred):
true_positives=K.sum(K.round(K.clip(y_true * y_pred, 0,1)))
predicted_positives=K.sum(K.round(K.clip(y_pred,0,1)))
precision=true_positives / (predicted_positives + K.epsilon())
return precision
def f1_m(y_true,y_pred):
precision=precision_m(y_true,y_pred)
recall=recall_m(y_true,y_pred)
return 2*((precision*recall)/(precision+recall+K.epsilon()))
Preprocessing
# For encoding the Target
ohe=OneHotEncoder(handle_unknown='ignore')
# Fit on Train
OHE=ohe.fit(y_train.values.reshape(-1,1))
# Transform on Validation and Train
OHE_target_train_ann=OHE.transform(y_train.values.reshape(-1,1)).toarray()
OHE_target_val_ann=OHE.transform(y_val.values.reshape(-1,1)).toarray()
# Check the shape
OHE_target_val_ann.shape
classifier_ANN=Sequential()
classifier_ANN.add(Dense(units=10,kernel_initializer='uniform',activation='relu',input_dim=19))
classifier_ANN.add(Dense(units=5,kernel_initializer='uniform',activation='relu'))
classifier_ANN.add(Dense(units=2,kernel_initializer='uniform',activation='softmax'))
classifier_ANN.compile(optimizer='adamax',loss='categorical_crossentropy',metrics=[f1_m])
classifier_ANN.summary()
history=classifier_ANN.fit(X_train,OHE_target_train_ann,validation_data=(X_val,OHE_target_val_ann),batch_size=20,epochs=100)
# Plotting F1 Score and loss curves for train and validation data.
plt.figure(figsize=(10,12))
plt.subplot(221)
plt.title('Loss')
plt.plot(history.history['loss'],label='train')
plt.plot(history.history['val_loss'],label='validation')
plt.legend()
plt.subplot(222)
plt.title('F1 Score')
plt.plot(history.history['f1_m'],label='train')
plt.plot(history.history['val_f1_m'],label='validation')
plt.legend()
plt.show()
# Predicting on train and validation data
y_pred_val_ann=classifier_ANN.predict(X_val)
y_classes_val_ann=y_pred_val_ann.argmax(axis=1)
y_pred_train_ann=classifier_ANN.predict(X_train)
y_classes_train_ann=y_pred_train_ann.argmax(axis=1)
f1_ann_val=f1_score(y_val,y_classes_val_ann)
f1_ann_tr=f1_score(y_train,y_classes_train_ann)
# Printng F1 Scores of train and validation data
print('Train F1-Score')
print(f1_ann_tr*100)
print('Validation F1-Score')
print(f1_ann_val*100)
test_for_prediction=test.drop(columns=['employee_id','age','region'])
test_for_prediction.shape
prediction_DT=classifier_DT_after.predict(test_for_prediction)
submission=pd.DataFrame({"employee_id":test["employee_id"],
"is_promoted":prediction_DT})
submission.to_csv('sub_dt_tuned_trans.csv',index=False)
prediction_RF=classifier_RF.predict(test_for_prediction)
submission=pd.DataFrame({"employee_id":test["employee_id"],
"is_promoted":prediction_RF})
submission.to_csv('sub_rf_trans.csv',index=False)
prediction_BCS=blended_classifier.predict(test_for_prediction)
submission=pd.DataFrame({"employee_id":test["employee_id"],
"is_promoted":prediction_BCS})
submission.to_csv('sub_bcs_trans.csv',index=False)
prediction_EVC=evc.predict(test_for_prediction)
submission=pd.DataFrame({"employee_id":test["employee_id"],
"is_promoted":prediction_EVC})
submission.to_csv('sub_evc_trans.csv',index=False)
prediction_GB=gb.predict(test_for_prediction)
submission=pd.DataFrame({"employee_id":test["employee_id"],
"is_promoted":prediction_GB})
submission.to_csv('sub_gb_trans.csv',index=False)
prediction_XGB=xgb.predict(test_for_prediction)
submission=pd.DataFrame({"employee_id":test["employee_id"],
"is_promoted":prediction_XGB})
submission.to_csv('sub_xgb_trans.csv',index=False)
prediction_CB=CB.predict(test_for_prediction)
submission=pd.DataFrame({"employee_id":test["employee_id"],
"is_promoted":prediction_CB})
submission.to_csv('sub_cb_trans.csv',index=False)
Best F1-Score on test set achieved on XGBoost classifier which is 51.82.